The goal of our project was to better understand the Miami 311 data set categories Animal Bites to a Person and Pitbull Investigations through visualization techniques using R and Geographical Information System (GIS) software.
Our first thoughts were to analyze the variables Goal Days versus Actual Days Completed. However, the complexity of the data set made this a challanging task. After carefull study of the levels in issue.type, the variable was narrowed to include only Animal Bite To A Person and Pit Bite Investigation. Initially, we thought there might be a relationship between the amount of animal bites and pit bull investigation. For example, an area with a high level of animal bites would have a cluster of pit bull investigation. However, in the first stages of exploratory analysis the idea proved to be unfruitful and we opted to explore the data through visualization.
Pitbulls are veiwed as an agressive breed that pose a danger to humans and other animals. Miami Dade County has legislature in place that puts specific restrictions on pitbulls. If the two categories of issue.type were correlated, we would better understand the implementation of pitbull restrictions.
The Miami Dade County Ordanance states pitbulls should be confined indoors or outdoors because pitbulls are naturally inclined to attack humans and other animals. In addition, the owner should have a “Dangerous Dog” sign posted. If the owners fail to comply with these rules the dogs will be muzzled to prevent bites and injuries to others. They are also to be kept on a leash. Exception to these rules are for dogs participating in dog shows, contests, or hunting.
To begin exploring Miami 311 Data we downloaded the csv file from the Miami Dade County website ( https://opendata.miamidade.gov/311/311-Service-Requests-Miami-Dade-County/dj6j-qg5t ) and saved it to our working directories. Next, we imported the csv file to an R object called df. Downloading and importing should take some time because the dataset contains over 641,000 rows. After filtering for our selected variables, there were over 13,000 observations.
#df <- read.csv("311_Service_Requests_-_Miami-Dade_County.csv")
#head(df) # much too long to display in document
#str(df) # much too long to display in document
#names(df)
Calling str(df) displays the variables, types of variables, and the first few entries per row. This data frame contains a mixture of categorical and numerical variables. Categorical variables are usually indicated by Factor: w/ levels. In addition, calling names(df) will display the column names.This data frame consists of 23 columns.
For the purpose of exploring the data we drew a random sample of size 50 from df. We called this df2. We set replace = FALSE so no entries were repeated. Then we saved it to a csv file titled “df2.csv”.
library(dplyr)
#df2 <- sample_n(df, size = 50, replace = FALSE)
# save as csv
#write.csv(df2, "df2.csv")
Using the dplyr package, we selected the columns needed for analysis from df2. We had no use for the columns titled: Ticket.Created.Date…Time, Ticket.Last.Updated.Date…Time, Ticket.Closed.Date…Time , Ticket.Status, X.Coordinate, and Y.Coordinate, among others. See the code below for what other columns we removed from the data frame. We saved the selected columns to a new R object. We called it cdf for clean data frame. The dplyr package is handy for cleaning because of its functions like select() and filter(). The pipe operator %>% is also useful to perform various tasks at once. We shortened the original column names and set them all to lower case. Using the functions fix_year(), fix_month(), and convert_month() from the Miami311p package, we changed the character strings in the created column into two additional columns of year and month. Then we removed the created column. Lastly, we decided it best to save data at every stage incase we need to retrace our steps and therefore, created a new csv file with the clean data.
df2 <- read.csv("df2.csv")
cdf<- df2 %>% select("Ticket.ID", "Issue.Type", "City", "Neighborhood...District...Ward...etc.", "Created.Year.Month", "Longitude", "Latitude", "Method.Received", "Goal.Days", "Actual.Completed.Days")
colnames(cdf) <- c("id", "issue.type", "city", "district", "created","longitude", "latitude", "method", "goal.days", "actual.days")
library(Miami311p)
#create vectors of months and years
year <- fix_year(cdf$created)
## [1] "2013" "2013" "2015" "2017" "2014" "2015" "2015" "2018" "2016" "2014"
## [11] "2013" "2015" "2017" "2016" "2015" "2015" "2015" "2015" "2017" "2017"
## [21] "2014" "2016" "2015" "2017" "2017" "2016" "2014" "2014" "2015" "2013"
## [31] "2015" "2014" "2015" "2017" "2015" "2013" "2015" "2017" "2014" "2017"
## [41] "2017" "2014" "2017" "2016" "2017" "2016" "2014" "2015" "2015" "2017"
month <- fix_month(cdf$created)
## [1] 6 7 6 1 4 1 12 1 6 2 11 8 5 10 11 1 3 8 6 1 4 4 10
## [24] 7 11 1 3 1 10 8 5 9 12 8 6 11 7 2 7 12 11 2 8 5 8 5
## [47] 8 3 11 8
month <- convert_month(month)
## [1] June July June January April January December
## [8] January June February November August May October
## [15] November January March August June January April
## [22] April October July November January March January
## [29] October August May September December August June
## [36] November July February July December November February
## [43] August May August May August March November
## [50] August
# bind month and years to clean data frame
cdf$year <- factor(year)
cdf$month <- factor(month)
#check structure
str(cdf)
## 'data.frame': 50 obs. of 12 variables:
## $ id : Factor w/ 50 levels "13-10037389",..: 1 2 20 37 10 16 30 50 35 7 ...
## $ issue.type : Factor w/ 30 levels "ABANDONED PROPERTY / VEHICLE",..: 17 13 29 8 27 30 20 10 19 6 ...
## $ city : Factor w/ 9 levels "City_of_Hialeah",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ district : Factor w/ 13 levels "District 1","District 10",..: 2 10 2 13 6 3 3 13 2 13 ...
## $ created : int 20136 20137 20156 20171 20144 20151 201512 20181 20166 20142 ...
## $ longitude : num -80.4 -80.3 -80.3 -80.4 -80.2 ...
## $ latitude : num 25.7 25.7 25.7 25.6 25.9 ...
## $ method : Factor w/ 9 levels "EMAIL","INHOUSE",..: 9 9 5 5 5 5 9 9 5 9 ...
## $ goal.days : int 90 90 10 4 30 30 30 180 30 120 ...
## $ actual.days: int 155 16 1 3 3 32 0 NA 0 1 ...
## $ year : Factor w/ 6 levels "2013","2014",..: 1 1 3 5 2 3 3 6 4 2 ...
## $ month : Factor w/ 12 levels "April","August",..: 7 6 7 5 1 5 3 5 7 4 ...
# check to for column position of "created" and remove it and reassign cdf
names(cdf)
## [1] "id" "issue.type" "city" "district" "created"
## [6] "longitude" "latitude" "method" "goal.days" "actual.days"
## [11] "year" "month"
cdf<- cdf[ , -5]
names(cdf)
## [1] "id" "issue.type" "city" "district" "longitude"
## [6] "latitude" "method" "goal.days" "actual.days" "year"
## [11] "month"
# examine new product
head(cdf , 3)
## id issue.type city
## 1 13-10037389 RIGHT OF WAY - MAINTENANCE Miami_Dade_County
## 2 13-10056138 JUNK AND TRASH / OVERGROWTH Miami_Dade_County
## 3 15-10188131 VISUAL OBSTRUCTION SAFETY ISSUE (RAAM) Miami_Dade_County
## district longitude latitude method goal.days actual.days year month
## 1 District 10 -80.40758 25.70405 XTERFACE 90 155 2013 June
## 2 District 6 -80.29507 25.74856 XTERFACE 90 16 2013 July
## 3 District 10 -80.34385 25.69773 PHONE 10 1 2015 June
# save to csv
# write.csv(cdf, "Clean df2.csv")
Once the data was cleaned we needed to narrow our scope of analysis. After calling str(cdf) we noticed that the city column had 37 levels, district had 14 levels, and issue.type, our main variable for analysis, had 205 levels. We imported the cleaned data set into Excel to take a closer look. Excell allowed us to look at the data set all at once and make use of its filter function that allowed us to examine the levels of each categorical variable. The picture below demonstrates the complexity of the data. There are sublevels within levels in the variable issue.type. For example there is the top level “Traffic” with multiple sublevels such as “Signal Ped Crossing Time Too Short” and “Sign Down Damaged Faded Missing (Other Than Control Sign)”.
Example of Excel Filter Function
The following sections use a data set subsetted from df using Excel. The column names have not been changed like in the initial cleaning presented above.
library(ggplot2)
## Need help? Try the ggplot2 mailing list:
## http://groups.google.com/group/ggplot2.
Read in the data into an R object.
pb<- read.csv("C:/Users/pietr/Desktop/Data/311 Bites and Pits.csv")
Create a plot using the ggplot2 package. The base of plotting in ggplot is always ggplot(). Within the aes() argument include x axis and the y axis will be the count. In this case, x = Created.Year.Month. We want to add color based on the Issue Type so we put that as out fill argument.
ggplot(pb, aes(`Created.Year.Month`, fill = `Issue.Type`)) +
# dodge will unstack the bars and put them side by side
geom_bar(position = "dodge")+
# x axis title
xlab('Year')+
# y axis title
ylab("Count")+
# title of graph
ggtitle("Animal Bite to a Person and Pit Bull Investigations 2013 - 2017")+
# add a legend: name = "title of legend", values = c("colors", "of", "legend"))
scale_fill_manual(name = "Issue Type", values = c("rosybrown3", "cornflowerblue"))+
# remove the default grey background
theme_minimal()+
#change legend position on graph
theme(legend.position = "top")+
#selects title of the plot, selects text of title, hjust = (side of graph 0 - 1)
# hjust = 0.5, will center the title
theme(plot.title = element_text(hjust = 0.5, size =15))
setwd("C:/Users/pietr/Desktop/Data")
library(dplyr)
library(ggplot2)
library(devtools)
## Warning: package 'usethis' was built under R version 3.4.4
library(stringr)
library(maps)
#install.packages("mapdata")
library("mapdata")
## Warning: package 'mapdata' was built under R version 3.4.4
#install.packages("ggmap")
#library(ggmap)
# Create your data object
pb<- read.csv("C:/Users/pietr/Desktop/Data/311 Bites and Pits.csv")
# Source: http://eriqande.github.io/rep-res-web/lectures/making-maps-with-R.html
# Create the Florida map
states <- map_data("state")
fl_map <- subset(states, region=="florida")
head(fl_map)
## long lat group order region subregion
## 1462 -85.01548 30.99702 9 1462 florida <NA>
## 1463 -84.99829 30.96264 9 1463 florida <NA>
## 1464 -84.97537 30.92253 9 1464 florida <NA>
## 1465 -84.94672 30.89962 9 1465 florida <NA>
## 1466 -84.94099 30.88815 9 1466 florida <NA>
## 1467 -84.94672 30.85951 9 1467 florida <NA>
# Get the counties
counties <- map_data("county")
# Filter out the county of Miami-Dade
md <- counties %>% filter(subregion == "miami-dade")
md
# Plot Miami-Dade
md_map <-ggplot(md, mapping = aes(x= long, y= lat))+
coord_fixed(1.3) +
geom_polygon(color= "black", fill="cornsilk2")+
xlab("Longitude")+
ylab("Latitiude")+
ggtitle("Miami Dade County\n2013-2017")+
theme_dark()+
theme(text = element_text(size = 15))
md_map
# Split the data to be plotted
set.seed(101)
pb2 <- sample_n(pb, size= 1000)
pb2$Issue.Type<- factor(pb2$Issue.Type)
pb2.split<- split(pb2, pb2$Issue.Type)
pb2b<- pb2.split$`ANIMAL BITE TO A PERSON`
pb2p<-pb2.split$`PIT BULL INVESTIGATION`
# Plot the data points by lattitude and longitude.
# Panel by year, color by issue type
md_map2 <- md_map+
geom_point(data = pb2b, aes(Longitude, Latitude, color= "rosybrown3"),alpha = 0.5)+
geom_point(data = pb2p, aes(Longitude, Latitude, color= "cornflowerblue"), alpha = 0.5)+
coord_fixed(1.3)+
facet_grid(.~Created.Year.Month)+
theme(plot.title = element_text(hjust = 0.5, size =30))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
scale_colour_manual(name = 'Issue Type', guide = "legend",
values =c('rosybrown3'='rosybrown3','cornflowerblue'='cornflowerblue'),
labels = c('ANIMAL BITE TO A PERSON','PIT BULL INVESTIGATION'))+
theme(legend.position = "top",
legend.text.align = 0.5,
legend.text = element_text(size = 8),
legend.title = element_text(size= 12),
legend.key=element_rect(fill = "white"))+
guides(colour = guide_legend(title.position = "top", title.hjust = 0.5))
md_map2
## Warning: Removed 5 rows containing missing values (geom_point).
The map created in R lacked redability. GIS provided a better way to display the spatial data. Below are some of the resulting graphics.
GIS Image of Animal Bites (white) and Pit Bull Investigations (black)
GIS Image of all Instances of Animal Bites and Pit Bull Investigations color coded by District
GIS Image of all Instances of Animal Bites and Pit Bull Investigations color coded by District
For this map, the plotted symbols are Animal Bites to a Person and Pitbull Investigations in 2017. The yellow symbols represents Animal Bites to a Person. They grey symbol is Pitbull Investigations. The heatmap also shows the concentration of incidents in various areas.
This map shows the concentration by district. The heatmap shows concentration per area.
The next two images below are a side by side comparison showing which areas had a high, medium, or low concentration of incidents.
GIS Image Pitbull Investigations (2013 - 2017).
Our team wanted to find the most frequent cities in Miami Dade county appearing among “Pit Bite Investigation” for the year 2017. [2]
Using the data set pb filter for the 2017 year. We saved it to the object pb2017.
pb2017<- pb %>% filter(`Created.Year.Month` == 2017)
Since we wanted to plot the most frequent cities, we first plotted all the cities using the plotly package. Ploty has interactive graphs so, we could directly point and click to find values. This graph showed an overwhelming amount of occurances came from “Miami_Dade_County”.Due to this, we decided to remove the city from our dataset. In addition, “Miami_Dade_County” is not a city of the county itself. We belive it is given to cases that have undocumented cities or is a default value.
g2 <- ggplot(pb2017, aes(City, fill = `Issue.Type`)) +
geom_bar(position = "dodge")
library(plotly)
ggplotly(g2)
Factor city. Split pb2017 by city. This creates a list of all the cities and the data that is attached to them. This allowed us to remove the city as an element of the list.
pb2017$City<- factor(pb2017$City)
pbsplit <- split(pb2017, pb2017$City)
tail(names(pbsplit),10)
## [1] "Miami_Dade_County" "Miami_Shores_Village"
## [3] "North_Bay_Village" "Town_of_Bay_Harbor_Islands"
## [5] "Town_of_Cutler_Bay" "Town_of_Medley"
## [7] "Village_of_Biscayne_Park" "Village_of_El_Portal"
## [9] "Village_of_Key_Biscayne" "Village_of_Virginia_Gardens"
Calling names(pbsplit) displayed the position of “Miami_Dade_County” in the list. The code below removes it.
pbsplit[[24]]<- NULL
tail(names(pbsplit), 10)
## [1] "City_of_West_Miami" "Miami_Shores_Village"
## [3] "North_Bay_Village" "Town_of_Bay_Harbor_Islands"
## [5] "Town_of_Cutler_Bay" "Town_of_Medley"
## [7] "Village_of_Biscayne_Park" "Village_of_El_Portal"
## [9] "Village_of_Key_Biscayne" "Village_of_Virginia_Gardens"
Combine the list back into a data frame. Character strings will be characterized as factors.
pbmerge <- do.call(rbind.data.frame, pbsplit)
# Just becasue we thought it important to know how much data we were missing
sum(is.na(pbmerge))
## [1] 6
Split the data frame by Issue Type. We called this object pbtype.split. Then assign the lists by Issue Type to bite.split for animal bites and pit.split for pit bull investigations.
pbtype.split<- split(pbmerge, pbmerge$Issue.Type)
bite.split<- pbtype.split$`ANIMAL BITE TO A PERSON`
pit.split<- pbtype.split$`PIT BULL INVESTIGATION`
Extract the city column from bite.split and pit.split. This is the text information we used to create the wordclouds. Save them to as tab deliminated file in two seperate empty folders in the working directory.
bite.city <- as.character(bite.split$City)
pit.city <- as.character(pit.split$City)
# save to txt file
#write.table(bite.city, "bite.txt", sep="\t")
#write.table(pit.city, "pit.txt", sep="\t")
Download the required packages.
library(tm)
library(wordcloud)
Create the corpus.
bite.text <- readLines("C:/Users/pietr/Desktop/Data/LIS 5802 Fianl Project bite word cloud/bite.txt")
bite.corpus <- Corpus(VectorSource(bite.text))
Clean the text. From inspecting the head of the bite corpus, there were many extra characters that we needed to remove such as “” and numbers. The toSpace function was taken from [2]. There is also other text cleaning code for patterns we thought we would need to remove but we chose not to because it sperated the phrases in the word cloud.
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
bite.corpus <- tm_map(bite.corpus, toSpace, "\t")
# bite.corpus <- tm_map(bite.corpus, toSpace, "_")
bite.corpus <- tm_map(bite.corpus, removeNumbers)
bite.corpus <- tm_map(bite.corpus, removePunctuation)
# bite.corpus <- tm_map(bite.corpus, content_transformer(tolower))
Create the term document matrix. The tm package does all the work here. We saved the matrix as bite.tdm
bite.tdm <- TermDocumentMatrix(bite.corpus)
bite.matrix <- as.matrix(bite.tdm)
bite.sort <- sort(rowSums(bite.matrix), decreasing = TRUE)
as.vector(bite.sort)
## [1] 296 112 64 63 53 52 41 39 35 30 29 18 18 17 13 13 12
## [18] 12 12 9 8 5 5 5 4 3 3 3 1 1 1
Create a vector with the populations of each city represented in the term document matrix.
# Divide by population
Frequency<- bite.sort
pop <- as.vector(c(453579, 236387, 44707, 45704, 60512, 107167, 46780,
58786, 87779, 41523, 23410, 35762, 18223, 11245, 11657,
13499, 21744, 29361, 15219, 12344, 13809, 5965, 10493,
838, 20832, 5744, 7137, 5628, 3055, 2325, 2375))
Combine the term document matrix.
City <- names(bite.sort)
bite.chart <- print(cbind(City, Frequency,pop), quotes = FALSE)
bite.chart <- as.data.frame(bite.chart)
Extract the frequency and population.
str(bite.chart)
## 'data.frame': 31 obs. of 3 variables:
## $ City : Factor w/ 31 levels "cityofaventura",..: 8 5 26 3 7 10 2 13 9 14 ...
## ..- attr(*, "names")= chr "cityofmiami" "cityofhialeah" "townofcutlerbay" "cityofdoral" ...
## $ Frequency: Factor w/ 21 levels "1","112","12",..: 8 2 19 18 17 16 14 12 11 10 ...
## ..- attr(*, "names")= chr "cityofmiami" "cityofhialeah" "townofcutlerbay" "cityofdoral" ...
## $ pop : Factor w/ 31 levels "10493","107167",..: 21 14 20 22 28 2 23 26 31 19 ...
## ..- attr(*, "names")= chr "cityofmiami" "cityofhialeah" "townofcutlerbay" "cityofdoral" ...
bite.chart$Frequency<- as.numeric(bite.chart$Frequency)
bite.chart$pop<- as.numeric(bite.chart$pop)
freq<- bite.chart$Frequency
n <- (freq/(pop))
bite.chart$n <- n
attach(bite.chart)
## The following objects are masked _by_ .GlobalEnv:
##
## City, Frequency, n, pop
bite.chart2 <- bite.chart[order(-n),]
Fix frequency data for plotting. This pairs the names the cities with their frequencies.We subsetted the 5 highest cities denoted by *fc from the data.
f <- as.numeric(as.vector(bite.chart2[1:5,4]))
c <- as.vector(bite.chart2[1:5,1])
fc<- as.data.frame(cbind(f,c))
fc$f<-as.numeric(f)
fc$c <- c("Town of Medley", "City of West Miami", "Village of Key Biscayne", "Town of Bay Harbor Islands", "City of Surfside")
Plot the frequencies in a bar chart.
ggplot(fc, aes(c, f))+
geom_bar(stat = "identity", fill="rosybrown3")+
xlab("City")+
ylab("Frequency per Population")+
ggtitle("Cities with Highest Frequency of 311 Calls for Animal Bite to a Person for 2017")+
theme_minimal()
# Create the corpus
pit.text <- readLines("C:/Users/pietr/Desktop/Data/LIS 5802 Final Project pit word cloud/pit.txt")
pit.corpus <- Corpus(VectorSource(pit.text))
inspect(pit.corpus)
# Clean the text. Code taken from:
pit.corpus <- tm_map(pit.corpus, toSpace, "\t")
pit.corpus <- tm_map(pit.corpus, removeNumbers)
pit.corpus <- tm_map(pit.corpus, removePunctuation)
#pit.corpus <- tm_map(pit.corpus, removeWords, c("CityofMiami")) # Remove after because of high population
inspect(pit.corpus)
# Term Document Matrix
pit.tdm <- TermDocumentMatrix(pit.corpus)
pit.matrix <- as.matrix(pit.tdm)
pit.sort <- sort(rowSums(pit.matrix), decreasing = TRUE)
as.vector(pit.sort)
# Combine Term Document Matrix into a table
City <- names(pit.sort)
Frequency <- as.numeric(pit.sort[is.numeric(pit.sort)])
pit.chart <- print(cbind(City, Frequency), quotes = FALSE)
pit.chart <- as.data.frame(pit.chart)
# Account for population
Frequency<- pit.sort
pop <- as.vector(c(453579, 107167, 60512, 40286, 87779, 58786,
58786, 41523, 29361, 11245, 15219, 46780, 5744,
45704, 21744, 13809, 23410, 11657, 10493, 35762,
18223, 5965, 2325))
str(pit.chart)
pit.chart$Frequency<- as.numeric(pit.chart$Frequency)
pit.chart$pop<- as.numeric(pop)
freq<- pit.chart$Frequency
n <- (freq/(pop*100))
pit.chart$n <- n
attach(pit.chart)
## The following objects are masked _by_ .GlobalEnv:
##
## City, Frequency, n, pop
## The following objects are masked from bite.chart:
##
## City, Frequency, n, pop
pit.chart2 <- pit.chart[order(-n),]
# Fix frequency Data for plotting
f <- as.numeric(as.vector(pit.chart[1:5,4]))
c <- as.vector(pit.chart[1:5,1])
fc<- as.data.frame(cbind(f,c))
fc$f<-as.numeric(f)
fc
fc$c <- c("City of Miami", "City of Miami Gardens", "City of Homestead", "City of Hialeah", "Town of Cutler Bay")
# Plot the frequencies
ggplot(fc, aes(c, f))+
geom_bar(stat = "identity", fill="cornflowerblue")+
xlab("City")+
ylab("Frequency per Population")+
ggtitle("Cities with Highest Frequency of 311 Calls for Pit Bull Investigations for 2017")+
theme_minimal()
While completing the project we came face to face with many issues. Our main limitation was over complicating and getting in the data analysis. For example, the word clouds below were made but did not account for the population and although they provided another way of displaying the frequency data, they were not needed (see images below). Also, the description of the Animal Bite 311 calls is missing from the data set. There is no way to tell what animals are involved in the actually biting incident. If the description of the call was there, we could only use the ones that regarded pitbull bites and then compare their locations to those of the Pitbull Investigation variable.
Word Cloud of Animal Bite frequency before the population was accounted for (2017)
Word Cloud for Pitbull Investigation frequency before population was accounted for (2017)
Certain areas of Miami Dade County had a high concentration of Animal Bites to a Person calls but not Pitbull Investigation calls. As mentioned before, having the descriptions of the calls could have given more insight to the data.The pit bull ordinance could also be an underlying factor in understanding why there were lower concentrations in pitbull investigations in certain areas. In conclusion, while we were able to see which areas had a higher concentration of both incidents, we couldn’t further analyze the data or get a better understanding for why there were higher concentrations in some areas and not in others.